b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
نویسندگان
چکیده
ABSTRACT Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [27] demonstrated a potential use of b-bit minwise hashing [26] for batch learning on large data. However, several critical issues must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the context of search.
منابع مشابه
b-Bit Minwise Hashing for Large-Scale Learning
Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and ...
متن کاملFast Near Neighbor Search in High-Dimensional Binary Data
Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating has...
متن کاملOne Permutation Hashing
Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 per...
متن کاملb-Bit Minwise Hashing for Large-Scale Linear SVM
Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...
متن کاملOne Permutation Hashing for Efficient Search and Learning
Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning (e.g., linear SVM or logistic regression) and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1205.2958 شماره
صفحات -
تاریخ انتشار 2012